Readme

require(kableExtra)
require(gt)
require(gtsummary)
require(tidyverse)
require(adegenet)
require(knitr)
require(magrittr)

This is document is an R notebook. If you’d like view to pre-rendered figures, read a summary of analysis and interact with code, please open the relevant html file in a browser.

To conduct a similar analyses on your computer, edit or run code: clone this repository into a directory on your local machine and open the .Rproj file in Rstudio.

The project repository is archived at github

Rationale

  • Subset the Oregon Coast Chinook Dataset to briefly examine differentiation between North and South Umpqua spring run Chinook Salmon

Data

Data Sources
Raw data included 1186 individuals from 11 river basins genotyped using the SFGL Ots353 GTseq Panel. After filtering, 977 individuals genotyped at 324 markers remained.

The full genotyping and filtering log is available in the project repository, here

Import Data

#import filtered genotype data
load("../project_genotyping/filtered_genotype_data/genind_2.0.R")
load("../project_genotyping/filtered_genotype_data/genos_2.2.R")

meta_data <- readxl::read_xlsx("../metadata/consolidated_metadata/combined_run_info.xlsx", sheet = 5)

Data Summary

Let’s filter the dataset to only include Umpqua spring Chinook salmon.

First, we’ll summarise what was sampled (not filtered)

kable(meta_data %>% 
        mutate(carcass = case_when(str_detect(Sample, "OtsCC") ~ "carcass",
                                   TRUE ~ "live")) %>%
        mutate(location = str_to_lower(location)) %>%
        filter(Population %in% c("NUMP", "SUMP", "UMPR"), run == "Spring") %>% 
        count(Population, stream, carcass, location), caption = "Umpqua River sample sizes in the before genotype quality filtering") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Umpqua River sample sizes in the before genotype quality filtering
Population stream carcass location n
NUMP N. Umpqua River carcass boulder to copeland 7
NUMP N. Umpqua River carcass full flow reach 7
NUMP N. Umpqua River carcass lower fish creek 11
NUMP N. Umpqua River carcass slide cr bypass reach 2
NUMP N. Umpqua River carcass soda springs to boulder cr 22
NUMP N. Umpqua River carcass upstream of soda springs 1
NUMP N. Umpqua River live winchester dam or rock cr hatchery 53
SUMP S. Umpqua River carcass 2823 bridge 1
SUMP S. Umpqua River carcass bedrock gorge 2
SUMP S. Umpqua River carcass na 3
SUMP S. Umpqua River carcass narrows 1
SUMP S. Umpqua River carcass skillet cr cg 3
SUMP S. Umpqua River live south umpqua falls 59

Now let’s summarize the data after filtering

kable(genos_2.2 %>% 
        mutate(carcass = case_when(str_detect(sample, "OtsCC") ~ "carcass",
                                   TRUE ~ "live")) %>%
        mutate(location = str_to_lower(location)) %>%
        filter(Population %in% c("NUMP", "SUMP"), run == "Spring") %>% 
        count(Population, stream, carcass, location), caption = "Sample sizes after filtering") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Sample sizes after filtering
Population stream carcass location n
NUMP N. Umpqua River carcass boulder to copeland 2
NUMP N. Umpqua River carcass full flow reach 1
NUMP N. Umpqua River carcass lower fish creek 2
NUMP N. Umpqua River carcass soda springs to boulder cr 2
NUMP N. Umpqua River carcass upstream of soda springs 1
NUMP N. Umpqua River live winchester dam or rock cr hatchery 49
SUMP S. Umpqua River carcass bedrock gorge 2
SUMP S. Umpqua River carcass fs-2838 bridge 1
SUMP S. Umpqua River carcass narrows 1
SUMP S. Umpqua River carcass skillet cr cg 2
SUMP S. Umpqua River carcass NA 2
SUMP S. Umpqua River live south umpqua falls 58

Also, let’s provide a simple, top level summary.

kable(genos_2.2 %>% 
        mutate(carcass = case_when(str_detect(sample, "OtsCC") ~ "carcass",
                                   TRUE ~ "live")) %>%
        mutate(location = str_to_lower(location)) %>%
        filter(Population %in% c("NUMP", "SUMP"), run == "Spring") %>% 
        count(stream), caption = "Sample sizes after filtering") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Sample sizes after filtering
stream n
N. Umpqua River 57
S. Umpqua River 66
#let's get just the samples we need
umpr_genos <- genos_2.2 %>%
  filter(Population %in% c("NUMP", "SUMP"), run == "Spring")

umpr_inds <- umpr_genos %>% 
  pull(sample)
genind_umpr <- genind_2.0[umpr_inds]

PCA

PCA (all samples)

As a first step let’s conduct a PCA.

# first scale an center dataet and impute missing data to meanallele frequency
X <- scaleGen(genind_umpr,  NA.method="mean")
#then run pca, keep all PCs
pca1 <- dudi.pca(X, scale = FALSE, scannf = FALSE, nf = 123)

snp_pcs <- pca1$li#[,c(1:kg)]

#now plot data
snp_pcs %<>%
  rownames_to_column(var="sample") %>%
  left_join(umpr_genos)

ggplot(data = snp_pcs)+geom_point(aes(Axis1, Axis2, color = Population), alpha = 0.5) +theme_classic()

ggplot(data = snp_pcs)+geom_point(aes(Axis2, Axis3, color = Population), alpha = 0.5) +theme_classic()

ggplot(data = snp_pcs)+geom_point(aes(Axis4, Axis5, color = Population), alpha = 0.5) +theme_classic()

# let's get rid of 
plotly::plot_ly(x=snp_pcs$Axis1, y=snp_pcs$Axis2, z=snp_pcs$Axis3, type="scatter3d", mode="markers", color=snp_pcs$Population, alpha = 0.8)
eigs <- as.data.frame(pca1$eig)

eigs %<>%
  rowid_to_column()

ggplot(data = eigs)+geom_bar(aes(rowid, `pca1$eig`), stat = "identity", position = "dodge")+theme_classic()+xlab("PCA axis")+ylab("eigenvalue")+ggtitle("Screeplot")

It looks like we have a few outlier individuals driving most of the variation along the primary axes. This can occur when there is little structure to the data and the PCA starts grabbing noise, however because this is a dataset with substantial LD, rare alleles in gene regions overrepresented in our dataset (e.g. Greb1L-Rock1 region) might produce a similar pattern. Let’s look more closely at the outliers to see if this is the case.

Since we know from a previous analysis which Greb1L-Rock1 region markers are in strong LD and which alleles are associated with spring or fall migration timing, we can quickly check if this is the case

Below we count the number of spring (TT), fall (AA), or heterozygous (TA) genotypes at marker Ots28_110073668 at both outlier and non-outlier samples

outlier_inds <- snp_pcs %>%
  filter(Axis1 > 5 | Axis2 > 10) %>%
  pull(sample)

umpr_genos %<>%
  mutate(outlier = case_when(sample %in% outlier_inds ~ "outlier",
                             TRUE ~ "not_outlier"))

umpr_genos %>%
  count(Ots28_11073668, outlier )

There are three individuals that have heterozygous genotypes at the Greb1L-Rock1 region and all three are outliers, but this can’t fully explain the outlier pattern because there are 7 total outliers.

Let’s go further. What markers are driving the first two axes.

kable(pca1$c1 %>%
  select(CS1) %>%
  slice_max(CS1, n = 40), caption = "top 20 markers loading on PCA axis 1") %>%
  kable_classic(full_width = F, html_font = "Cambria") %>%
  scroll_box(width = "500px", height = "200px")
top 20 markers loading on PCA axis 1
CS1
Ots28_11071377.C 0.1489704
Ots28_11072994.T 0.1489630
Ots28_11073102.A 0.1489630
Ots28_11073668.A 0.1489630
Ots28_11075348.A 0.1489630
Ots28_11075712.T 0.1489630
Ots28_11077016.T 0.1489630
Ots28_11077576.G 0.1489630
Ots37124-12277401.A 0.1489630
Ots37124-12281207.A 0.1489630
Ots28_11077172.A 0.1489176
Ots28_11164637.C 0.1394907
Ots37124-12310649.T 0.1394907
Ots28_11160599.G 0.1394459
Ots28_11095755.A 0.1394012
Ots28_11206740.T 0.1253808
Ots28_11201129.T 0.1253602
Ots28_11205423.A 0.1253602
Ots28_11070757.G 0.1122982
Ots37124-12272852.C 0.0909242
Ots37124-12270118.A 0.0875610
Ots37124-12267397.C 0.0847719
Ots28_11202190.T 0.0835105
Ots28_11062192.G 0.0813845
Ots28_11186543.A 0.0774366
Ots_tpx2-125.T 0.0697439
Ots_U2567-104.A 0.0589868
Ots_RAG3.T 0.0481807
Ots28_11210919.C 0.0447974
Ots_118205-61.T 0.0444900
Ots_117370-471.G 0.0424047
Ots_P450-288.G 0.0385130
Ots_HSP90B-100.C 0.0368698
Ots_BMP2-SNP1.T 0.0357583
Ots_110495-380.C 0.0355402
Ots_U5121-34.G 0.0351289
Ots_unk1104-38.T 0.0349621
Ots28_11202400.C 0.0340045
Ots28_11205993.C 0.0339713
Ots28_11207428.T 0.0332233
kable(pca1$c1 %>%
  select(CS2) %>%
  slice_max(CS2, n = 40), caption = "top 20 markers loading on PCA axis 2") %>%
  kable_classic(full_width = F, html_font = "Cambria") %>%
  scroll_box(width = "500px", height = "200px")
top 20 markers loading on PCA axis 2
CS2
Ots_CHI06048618_5222.G 0.1431457
Ots11_11925999.G 0.1322466
Ots4_64978818.C 0.1303041
Ots_u07-25_325.T 0.1233726
Ots_u211-85.C 0.1181128
Ots4_40942276.G 0.1123654
Ots7_51409415.T 0.1119642
Ots4_42378741.C 0.1108644
Ots2_42405643.T 0.1066928
Ots_nramp-321.G 0.1052772
Ots9_28975221.A 0.1037127
Ots7_50997124.G 0.1020991
Ots_FARSLA-220.G 0.0907195
Ots_GPH-318.T 0.0903188
Ots3_57055518.T 0.0899086
Ots17_1345774_C6.T 0.0888702
Ots17_1488679_C6.G 0.0888702
Ots17_885364.C 0.0876775
Ots29_23344676.C 0.0868057
Ots_113457-40R.C 0.0854005
Ots18_32088284.T 0.0840288
Ots_u07-17_135.A 0.0834243
Ots_ppie-245.A 0.0828972
Ots_cox1-241.T 0.0810016
Ots_unk9480-51.G 0.0770229
Ots19_46172133.C 0.0736705
Ots_ZR-575.A 0.0735555
Ots_P450.T 0.0683253
Ots6_10904949.C 0.0676942
Ots_cgo24-22.T 0.0672572
Ots_IL11.T 0.0671326
Ots_hnRNPL-533.T 0.0670826
Ots_FGF6B_1.A 0.0623263
Ots_crRAD255-59.C 0.0620136
Ots11_32418659.T 0.0604396
Ots_RAG3.T 0.0603979
Ots_101554-407.G 0.0601383
Ots_HSP90B-100.C 0.0598335
Ots_104415-88.T 0.0598167
Ots_Est1363.T 0.0569970

Okay, we can see that axis 1 outliers are driven by the Greb1L-Rock1 region markers, but axis 2 outliers are driven by a unrelated set of markers.

In any case, let’s remove the outliers and conduct a PCA again.

Outlier Metadata

The table below collects the metadata for the 7 outlier individuals.

kable(genos_2.2 %>%
  filter(sample %in% outlier_inds) %>% 
  mutate(carcass = case_when(str_detect(sample, "OtsCC") ~ "carcass",
                                   TRUE ~ "live")) %>%
    mutate(location = str_to_lower(location)) %>%
  select(sample, carcass, date, Population, location, run, sex, meps, fl, origin) %>%
  arrange(Population, location), caption = "Metadata for Outlier Individuals") %>%
  kable_classic(full_width = T, html_font = "Cambria") %>%
  scroll_box(width = "900px", height = "200px")
Metadata for Outlier Individuals
sample carcass date Population location run sex meps fl origin
OtsCC20NUMP_0020 carcass 10/6/20 NUMP lower fish creek Spring F 745 NA NOR
OtsAC20NUMP_0017 live NA NUMP winchester dam or rock cr hatchery Spring M NA 690 NOR
OtsAC20NUMP_0023 live NA NUMP winchester dam or rock cr hatchery Spring M NA 835 NOR
OtsAC20NUMP_0032 live NA NUMP winchester dam or rock cr hatchery Spring M NA 840 NOR
OtsCC20SUMP_0002 carcass 10/19/20 SUMP bedrock gorge Spring F 590 NA NOR
OtsAC21SUMP_0026 live 5/19/21 SUMP south umpqua falls Spring M NA 615 NOR
OtsAC21SUMP_0059 live 5/28/21 SUMP south umpqua falls Spring M NA 555 HOR

PCA (exclude outliers)

# first scale an center dataet and impute missing data to meanallele frequency
non_outliers <- umpr_genos$sample[!(umpr_genos$sample %in% outlier_inds)]

X2 <- scaleGen(genind_umpr[non_outliers],  NA.method="mean")
#then run pca, keep all PCs
pca2 <- dudi.pca(X2, scale = FALSE, scannf = FALSE, nf = 123)

snp_pcs <- pca2$li#[,c(1:kg)]

#now plot data
snp_pcs %<>%
  rownames_to_column(var="sample") %>%
  left_join(umpr_genos)

ggplot(data = snp_pcs)+geom_point(aes(Axis1, Axis2, color = Population), alpha = 0.5) +theme_classic()

ggplot(data = snp_pcs)+geom_point(aes(Axis2, Axis3, color = Population), alpha = 0.5) +theme_classic()

ggplot(data = snp_pcs)+geom_point(aes(Axis4, Axis5, color = Population), alpha = 0.5) +theme_classic()

# let's get rid of 
plotly::plot_ly(x=snp_pcs$Axis1, y=snp_pcs$Axis2, z=snp_pcs$Axis3, type="scatter3d", mode="markers", color=snp_pcs$Population, alpha = 0.8)
eigs <- as.data.frame(pca2$eig)

eigs %<>%
  rowid_to_column()

ggplot(data = eigs)+geom_bar(aes(rowid, `pca2$eig`), stat = "identity", position = "dodge")+theme_classic()+xlab("PCA axis")+ylab("eigenvalue")+ggtitle("Screeplot")

PCA Summary

No, I don’t suspect there is any substantial structure in the data from PCA results. There is substantial overlap in PC space between our samples of North and South Umpqua spring Chinook salmon.

FST

Let’s estimate the level of differentiation using FST. We will estimate some basic F statistics using hierfstat and also the pairwise Fst using the Weir and Cockerham. (Note 1 that we left the outlier individuals in the dataset for these estimates) (Note 2 that this is a simple calculation of Fstatistics, dataset wide that is unlikely to be reflective of genome-wide neutral varaition becuase of LD and the bias in our panel towards ESTs and adaptive genetic variation)

First some basic F stats

require(hierfstat)
## Loading required package: hierfstat
## Warning: package 'hierfstat' was built under R version 3.6.2
## 
## Attaching package: 'hierfstat'
## The following object is masked from 'package:adegenet':
## 
##     read.fstat
fstat <- genind2hierfstat(genind_umpr)
colnames(fstat) <- c(pop, names(genind_umpr$loc.n.all))

#calculate datset wide basic stats
basicstats <- basic.stats(fstat)

kable(basicstats$overall, caption = "Full Dataset Fstats")
Full Dataset Fstats
x
Ho 0.2833
Hs 0.2905
Ht 0.2917
Dst 0.0012
Htp 0.2929
Dstp 0.0024
Fst 0.0040
Fstp 0.0081
Fis 0.0249
Dest 0.0033

Now pairwise FST using Weir and Cockerham

genet.dist(fstat, method="WC84")
##             NUMP
## SUMP 0.008064787

Summary

As suspected from the PCA results, differentiation between our sample North and South Umpqua River spring run Chinook salmon is low at 0.008.

Conclusion

Overall FST was low (0.008),and PCA did not reveal any clear structure to the data. This suggests there are no strong genetic differences between our samples of North and South Umpqua River spring run Chinook salmon.

Importantly, our GT-seq panel only provides a small snapshot of the genetic variation within and among these groups and other potentially ecologically relevant genetic differences between these groups may still exist.